fix: chain merge continuations instead of stopping after one pair#156
Open
Frank-Schruefer wants to merge 2 commits intodocling-project:mainfrom
Open
fix: chain merge continuations instead of stopping after one pair#156Frank-Schruefer wants to merge 2 commits intodocling-project:mainfrom
Frank-Schruefer wants to merge 2 commits intodocling-project:mainfrom
Conversation
predict_merges merged at most one successor per TEXT element: after recording merges[A] = [B] it set curr_ind = B and moved on, so C was processed as a fresh element rather than being appended to A's merge list. If A continues into B and B continues into C by the same criteria, all three belong to the same paragraph. Stopping at A→B and then C→D produces two short paragraphs instead of one. Fix: replace the single if-block with a while loop that keeps extending the merge chain from the current tail as long as the continuation condition holds. Also moves the skip-label list to a local variable and fixes the element-vs-label comparison in the skip loop (sorted_elements [ind_p1].label instead of sorted_elements[ind_p1]). Signed-off-by: stone <[email protected]>
Contributor
|
✅ DCO Check Passed Thanks @Frank-Schruefer, all your commits are properly signed off. 🎉 |
Contributor
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
| ind_p1 < len(sorted_elements) | ||
| and sorted_elements[ind_p1].label == elem.label | ||
| and ( | ||
| elem.page_no != sorted_elements[ind_p1].label |
Member
There was a problem hiding this comment.
I guess this should be elem.page_no != sorted_elements[ind_p1].page_no
Signed-off-by: stone <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
predict_mergesmerges at most one successor per TEXT element. After recordingmerges[A] = [B]it setscurr_ind = Band moves on, so C is processed as a fresh element rather than being appended to A's merge list.If A continues into B and B continues into C by the same criteria, all three belong to the same paragraph. The current code produces two short paragraphs (A+B and C+D) instead of one.
This becomes visible with PDFs that have many small orphan clusters — for example scanned pages with a native text layer, where each line of text is its own cluster. The pairwise limit leaves every other line as a separate paragraph, causing unnatural blank lines throughout the output.
Fix
Replace the single
if-block with awhileloop that keeps extending the merge chain from the current tail (check_ind) as long as the continuation condition holds:The existing
_readingorder_elements_to_docling_docalready iterates over the full merge list per element, so no changes are needed there.Also fixes the skip-loop element-vs-label comparison (
sorted_elements[ind_p1].label in skip_labelsinstead ofsorted_elements[ind_p1] in [...]which always evaluated to False — also submitted separately as #153).